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Abstract — The problem of distributed learning and channel 
access is considered in a cognitive network with multiple sec- 
ondary users. The availability statistics of the channels are 
initially unknown to the secondary users and are estimated using 
sensing decisions. There is no explicit information exchange or 
prior agreement among the secondary users. We propose policies 
for distributed learning and access which achieve order-optimal 
cognitive system throughput (number of successful secondary 
transmissions) under self play, i.e., when implemented at all the 
secondary users. Equivalently, our policies minimize the regret 
in distributed learning and access. We first consider the scenario 
when the number of secondary users is known to the policy, 
and prove that the total regret is logarithmic in the number of 
transmission slots. Our distributed learning and access policy 
achieves order-optimal regret by comparing to an asymptotic 
lower bound for regret under any uniformly-good learning and 
access policy. We then consider the case when the number of 
secondary users is fixed but unknown, and is estimated through 
feedback. We propose a policy in this scenario whose asymptotic 
sum regret which grows slightly faster than logarithmic in the 
number of transmission slots. 

Index Terms — Cognitive medium access control, multi-armed 
bandits, distributed algorithms, logarithmic regret. 



I. Introduction 

There has been extensive research on cognitive radio net- 
work in the past decade to resolve many challenges not 
encountered previously in traditional communication networks 
(see f21). One of the main challenges is to achieve coex- 
istence of heterogeneous users accessing the same part of 
the spectrum. In a typical cognitive network, there are two 
classes of transmitting users, viz., the primary users who have 
priority in accessing the spectrum and the secondary users who 
opportunistically transmit when the primary user is idle. The 
secondary users are cognitive and can sense the spectrum to 
detect the presence of a primary transmission. However, due 
to resource and hardware constraints, they can sense only a 
part of the spectrum at any given time. 

We consider a slotted cognitive system where each sec- 
ondary user can sense and access only one orthogonal channel 
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in each transmission slot (see FigUli. Under sensing con- 
straints, it is thus beneficial for the secondary users to select 
channels with higher mean availability, i.e., channels which are 
less likely to be occupied by the primary users. However, in 
practice, the channel availability statistics are a priori unknown 
to the secondary users. 

Since the secondary users are required to sense the medium 
before transmission, can these sensing decisions be used to 
learn the channel availability statistics? If so, using these 
estimated channel availabilities, can we design channel access 
rules which maximize the transmission throughput? Designing 
provably efficient algorithms to accomplish the above goals 
forms the focus of our paper. Such algorithms need to be 
efficient, both in terms of learning and channel access. 

For any learning algorithm, there are two important per- 
formance criteria: convergence and regret bounds LSJ. In the 
above context, we require the estimates to converge to the 
correct channel availability statistics as the number of available 
sensing decisions goes to infinity. A stronger criterion is the 
regret of a learning algorithm, which measures the speed 
of convergence. In our context, the regret is the loss in 
secondary throughput due to learning compared with knowing 
the channel statistics perfectly. Hence, it is desirable for the 
learning algorithms to have small regret. The regret is a finer 
measure of performance of a learning algorithm than the time- 
averaged throughput since a sub-linear regret (with respect to 
time) implies optimal average throughput. 

Additionally, we consider a distributed framework where 
there is no information exchange or prior agreement among 
the secondary users. This introduces additional challenges: 
it results in loss of throughput due to collisions among the 
secondary users, and there is now competition among the 
secondary users since they all tend to access channels with 
higher availabilities. It is imperative for the channel access 
policies to overcome the above challenges. Hence, a distributed 
learning and access policy experiences regret both due to 
learning of the unknown channel availabihties as well as due 
to collisions under distributed access. 

A. Our Contributions 

The main contributions of this paper are two fold. First, 
we propose two distributed learning and access policies for 
multiple secondary users in a cognitive network. Second, we 
provide performance guarantees for these policies in terms of 
regret. Overall, we prove that one of our proposed algorithms 
achieves order-optimal regret and the other achieves nearly 
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Fig. 1. Cognitive radio network with U = 4 secondary users and C = 5 
channels. A secondary user is not allowed to transmit if the accessed channel 
is occupied by a primary user. If more than one secondary user transmits in 
the same free channel, then all the transmissions are unsuccessful. 

order-optimal regret, where the order is in terms of the number 
of transmission slots. 

The first policy we propose assumes that the total number 
of secondary users in the system is known while our sec- 
ond policy relaxes this requirement. Our second policy also 
incorporates estimation of the number of secondary users, in 
addition to learning of the channel availabilities and designing 
distributed access rules. We provide bounds on total regret 
experienced by the secondary users under self play, i.e., when 
implemented at all the secondary users. For the first policy, we 
prove that the regret is logarithmic, i.e., 0(log7i) where n in 
the number of transmission slots. For the second policy, the re- 
gret grows slightly faster than logarithmic, i.e., 0{f{n) logn), 
where we can choose any function f{n) satisfying f{n) — > oo, 
as n — > oo. Hence, we provide performance guarantees for the 
proposed distributed learning and access policies. 

A lower bound on regret under any uniformly-good dis- 
tributed learning policy has been derived in [4J, which is also 
logarithmic in the number of transmission slots. Thus, our first 
policy (which requires knowledge of the number of secondary 
users) achieves order-optimal regret. The effects of the number 
of secondary users and the number of channels on regret are 
also explicitly characterized and verified via simulations. 

To the best of our knowledge, the exploration-exploitation 
tradeoff for learning, combined with the cooperation- 
competition tradeoffs among multiple users for distributed 
medium access have not been sufficiently examined in the 
literature before (see Section II-BI for a discussion). Our 
analysis in this paper provides important engineering insights 
towards dealing with learning, competition, and cooperation 
in practical cognitive systems. 

Remark: We note some of the shortcomings of our approach. 
The i.i.d. modefl for primary transmissions is indeed idealistic 
and in practice, a Markovian model may be more appropriate 
lH], f6}i. However, the i.i.d. model is a good approximation if 
the time slots for transmissions are long and/or the primary 
traffic is highly bursty. Moreover, the i.i.d. model is not crucial 
towards deriving regret bounds for our proposed schemes. 

'By i.i.d. primary transmission model, we do not mean the presence of 
a single primary user, but rather, this model is used to capture the overall 
statistical behavior of all the primary users in the system. 



Extensions of the classical multi-armed bandit problem to 
a Markovian model are considered in |7|. In principle, our 
results on distributed learning and access can be similarly 
extended to a Markovian channel model but this entails more 
complex estimators and rules for evaluating the exploration- 
exploitation tradeoffs of different channels and is a topic of 
interest for future investigation. 

B. Related Work 

Several results on the multi-armed bandit problem will be 
used and generalized to study our problem. Detailed discussion 
on multi-armed bandits can be found in lISI- llTTl . Cognitive 
medium access is a topic of extensive research; see [12] 
for an overview. The connection between cognitive medium 
access and the multi-armed bandit problem is explored in 1 13 1, 
where a restless bandit formulation is employed. Under this 
formulation, indexability is established, the Whittle's index 
for channel selection is obtained in closed-form, and the 
equivalence between the myopic policy and the Whittle's 
index is established. However, this work assumes known 
channel availability statistics and does not consider competing 
secondary users. The work in [14] considers allocation of two 
users to two channels under Markovian channel model using 
a partially observable Markov decision process (POMDP) 
framework. The use of collision feedback information for 
learning, and spatial heterogeneity in spectrum opportunities 
were investigated. However, the difference from our work 
is that [14) assumes that the availability statistics (transition 
probabilities) of the channels are known to the secondary users 
while we consider learning of unknown channel statistics. 
The works in fT5\, fT6\ consider centrahzed access schemes 
in contrast to distributed access here, ITTl considers access 
through information exchange and studies the optimal choice 
of the amount of information to be exchanged given the 
cost of negotiation. [18| considers access under Q-learning 
for two users and two channels where users can sense both 
the channels simultaneously. The work in [19| discusses a 
game-theoretic approach to cognitive medium access. In [20|, 
learning in congestion games through multiplicative updates 
is considered and convergence to weakly-stable equilibria 
(which reduces to the pure Nash equilibrium for almost all 
games) is proven. However, the work assumes fixed costs (or 
equivalently rewards) in contrast to random rewards here, and 
that the players can fully observe the actions of other players. 

Recently, the work in IISTI considers combinatorial bandits, 
where a more general model of different (unknown) channel 
availabilities is assumed for different secondary users, and a 
matching algorithm is proposed for jointly allocating users 
to channels. The algorithm is guaranteed to have logarithmic 
regret with respect to number of transmission slots and poly- 
nomial storage requirements. A decentralized implementation 
of the proposed algorithm is proposed but it still requires 
information exchange and coordination among the users. In 
contrast, we propose algorithms which removes this require- 
ment albeit in a more restrictive setting. 

In our recent work fl], we first formulated the problem 
of decentralized learning and access for multiple secondary 
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users. We considered two scenarios: one where there is initial 
common information among the secondary users in the form of 
pre-allocated ranks, and the other where no such information 
is available. In this paper, we analyze the distributed policy in 
detail and prove that it has logarithmic regret. In addition, we 
also consider the case when the number of secondary users is 
unknown, and provide bounds on regret in this scenario. 

Recently, Liu and Zhao f4] proposed a family of distributed 
learning and access policies known as time-division fair share 
(TDFS), and proved logarithmic regret for these policies. They 
established a lower bound on the growth rate of system regret 
for a general class of uniformly-good decentralized polices. 
The TDFS poUcies in ||4;| can incorporate any order-optimal 
single-player policy while our work here is based on the 
single-user policy proposed in 111]. Another difference is that 
in m, the users orthogonalize via settling at different offsets 
in their time-sharing schedule, while in our work here, users 
orthogonalize into different channels. Moreover, the TDFS 
poUcies ensure that each player achieves the same time- 
average reward while our policies here achieve probabilistic 
fairness, in the sense that the policies do not discriminate be- 
tween different users. In ll22l . the TDFS policies are extended 
to incorporate imperfect sensing. 

Organization & Suggested Reading: Section |ll] deals with 
the system model. Section |III] deals with the special case of 
single secondary user and of multiple users with centralized 
access which can be directly solved using the classical results 
on multi-armed bandits. In Section |IV] we propose distributed 
learning and access policy with provably logarithmic regret 
when the number of secondary users is known. Section |V] 
considers the scenario when the number of secondary users is 
unknown. Section |VT] provides a lower bound for distributed 
learning. Section IVIII has simulation results for the proposed 
schemes and Section IVIIII concludes the paper. Most of the 
proofs are found in the Appendix. 

Since Section |III] mostly deals with a recap of the classical 
results on multi-armed bandits, we suggest that an experienced 
reader directly jump to Section |IV] for the main results of this 
paper. 

II. System Model & Formulation 

Notation: For any two functions f{n),g{n), f{n) = 
0{g{n)) if there exists a constant c such that f{n) < cg{n) 
for all n > uq for a fixed uq G N. Similarly, f{n) — n{g{n)) 
if there exists a constant c' such that f(n) > c'g{n) for all 
n > no for a fixed no G N, and f{n) = Q{g{n)) if /(n) = 
Q{g{nj) and /(n) = 0{g{n)). Also, f{n) = o{g{n)) when 
f{n)/g(n) and /(n) = uj{g{n)) when f{n)lg{n) oo 
as rt — > oo. 

We refer to the U highest entries in a vector /j, as the J7-best 
channels and the rest as the [/-worst channels. Let a{T; jjb) 
denote the index of the T"' highest entry in jjb. Alternatively, 
we abbreviate T*:=<j{T; /j.) for ease of notation. With abuse of 
notation, let D{iii^ ^i2):=D{B{iJLi)] B{ijl2)) be the Kullback- 
Leibler distance between the Bernoulli distributions -8(^1) and 
B{^X2) [23] and let A(l, 2):=^i - ^2- 



A. Sensing & Channel Models 

Let [/ > 1 be the number of secondary user H and C > J7 
be the numbe:0 of orthogonal channels available for slotted 
transmissions with a fixed slot width. In each channel i and slot 
fc, the primary user transmits i.i.d. with probability 1 — /i^ > 0. 
In other words, let Wi{k) denote the indicator variable if the 
channel is free 

f 0, channel i occupied in slot k 
L 1, o.w, 

and we assume that Wi(k) B{^i). 

The mean availability vector /i, consists of mean avail- 
abilities Hi of all channels, i.e., is /x:=[/ii, . . . , /ic], where 
all Hi G (0, 1) and are distinct. //, is initially unknown 
to all the secondary users and is learnt independently over 
time using the past sensing decisions without any information 
exchange among the users. We assume that sensing for primary 
transmissions is perfect at all the users. 

Let Tij{k) denote the number of slots where channel i is 
sensed in k slots by user j (not necessarily being the sole 
occupant of that channel). The sensing variables are obtained 
as follows: at the beginning of each slot k, each secondary 
user j E U selects exactly one channel i <E C for sensing, and 
hence, obtains the value of Wi{k), indicating if the channel is 
free. User j then records all the sensing decisions of each 
channel i in a vector X.^ j:=[Xij{l), . . . , Xij{Ti,j{k))]'^ . 
Hence, U'fL^'K'^ , is the collection of sensed decisions for user 
j in k slots for all the C channels. 

We assume the collision model under which if two or more 
users transmit in the same channel then none of the transmis- 
sions go through. At the end of each slot fc, each user j receives 
acknowledgement Zj{k) on whether its transmission in the fc"" 
slot was received. Hence, in general, any policy employed by 
user j in the (fc + l)-th slot, given by p{uf^i'X.^ j , Z^) is based 
on all the previous sensing and feedback results. 

B. Regret of a Policy 

Under the above model, we are interested in designing 
policies p which maximize the expected number of successful 
transmissions of the secondary users subject to the non- 
interference constraint for the primary users. Let S{n; fi, U, p) 
be the expected total number of successful transmissions after 
n slots under U number of secondary users and policy p. 

In the ideal scenario where the availability statistics fi are 
known a priori and a central agent orthogonally allocates the 
secondary users to the U -best channels, the expected number 
of successful transmissions after n slots is given by 

u 
i=i 

where j* is the /''-highest entry in fi. 

user refers to a secondary user unless otherwise mentioned. 
'Wlien U > C, learning availability statistics is less crucial, since all 
channels need to be accessed to avoid collisions. In this case, design of 
medium access is more crucial. 
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Algorithm 1 Single User Policy p^{g{n)) in |lp|. 

Input: {Xi{n)}i=i^,,,^c ■ Sample-mean availabilities after n 
rounds, g{i;n): statistic based on X.ij{n), 
a{T;g{n)): index of T"" highest entry in g(n). 
Init: Sense in each channel once, n C 
Loop: n n + \ 

Curr_Sel channel corresponding to highest entry in g(ri) 
for sensing. If free, transmit. 



It is clear that S*{n] fi, U) > S{n\ fi, U, p) for any policy 
p and finite n. We are interested in minimizing the regret in 
learning and access, given by 

R{n;n,U,py.=S*in;fi,U) - S{n;fi,U,p) > 0. (2) 

We are interested in minimizing regret under any given fi G 
(0, l)*-^ with distinct elements. 

By incorporating the collision channel model assumption 
with no avoidance mechanism^ the expected throughput 
under policy p is given by 

c u 

5(n; /X, (7, p) = ^ ^ mWIE[K, W], 
i=i j=i 

where Vij{n) is the number of times in n slots where user 
j is the sole user to sense channel i. Hence, the regret in (|2]i 
simplifies as 

u c u 

R{n; p)^Y. '^^(^*) -EE /^(^^[l^,, (n)]. (3) 

k=l i=l j=l 

III. Special Cases From Known Results 

We recap the bounds for the regret under the special cases 
of a single secondary user ([/ = !) and multiple users with 
centralized learning and access by appealing to the classical 
results on the multi-armed bandit process L8l- lfT0l . 



A. Single Secondary User (U — 1) 

When there is only one secondary user, the problem of 
finding policies with minimum regret reduces to that of a 
multi-armed bandit process. Lai and Robbins LSJ first analyzed 
schemes for multi-armed bandits with asymptotic logarithmic 
regret based on the upper confidence bounds on the unknown 
channel availabilities. Since then, simpler schemes have been 
proposed in ifTOl . ifTTI which compute a statistic or an index for 
each arm (channel), henceforth referred to as the g-statistic, 
based only on its sample mean and the number of slots where 
the particular arm is sensed. The arm with the highest index is 
selected in each slot in these works. We summarize the policy 
in Algorithm [T] and denote it p^(g(n)), where g(n) is the 
vector of scores assigned to the channels after n transmission 
slots. 

"^The effect of employing CSMA-CA is not considered here although it 
can be shown that it reduces the regret and hence, the bounds we derive are 
applicable. 



The sample-mean based policy in IfTTI Thm. 1] proposes an 
index for each channel i and user j at time n is given by 



gf™(^;n):=X,,,(T,,,(n)) + 



/ 2 log n 



(4) 



where T, 



is the number of slots where user j selects 



channel i for sensing and 



fc=l 



T^.J{n) 



is the sample-mean availability of channel i, as sensed by user 
j- 

The statistic in (HJi captures the exploration-exploitation 
tradeoff between sensing the channel with the best predicted 
availability to maximize immediate throughput and sensing 
different channels to obtain improved estimates of their avail- 
abilities. The sample-mean term in ^ corresponds to exploita- 
tion while the other term involving Tij{n) corresponds to 
exploration since it penalizes channels which are not sensed 
often. The normalization of the exploration term with log n in 
(|4|i implies that the term is significant when Ti,j{n) is much 
smaller than logn. On the other hand, if all the channels 
are sensed 8(logn) number of times, the exploration terms 
become unimportant in the g-statistics of the channels and the 
exploitation term dominates, thereby, favoring sensing of the 
channel with the highest sample mean. 

The regret based on the above statistic in (HJi is logarithmic 
for any finite number of slots n but does not have the optimal 
scaling constant. The sample-mean based statistic in ifTOl 
Example 5.7] leads to the optimal scaling constant for regret 
and is given by 



I log n 



, 1 



(5) 



In this paper, we design policies based on the g"™ statistic 
since it is simpler to analyze than the 17°" statistic. 

We now recap the results which show logarithmic regret in 
learning the best channel. In this context, we define uniformly 
good policies p [8] as those with regret 



R{n;n,U,p) = o{n°'), Va > 0, £ (0, 1)^ 



(6) 



Theorem 1 (Logarithmic Regret for U = \ HlOV , IlllV ): 
For any uniformly good policy p satisfying (|6]l, the expected 
time spent in any suboptimal channel i ^ 1* satisfies 



lim ] 

n— >oo 



'rr. / N (1 - e)\ogn 
Ti,i{n) > — 



1, 



(7) 



D{pi,pi') 

where 1* is the channel with the best availability. Hence, the 
regret satisfies 

A(r,z) 



Rimini ^^J^> y 

n^oa logn — ' 



D{pi,pi-)' 



(8) 



l-worst 

The regret under the g°" statistic in (|5]l achieves the above 
bound. 



lim 

n— >oo 



i?(n;/x,l,pi(gr)) 



log n 



E 

1 -worst 



A(r,») _ 

D{pi,pi-)' 



(9) 
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Algorithm 2 Centralized Learning Policy in 

Input: X"- := U^^i uf^^ X^^ : Channel availability after n 
slots, g(n): statistic based on A"", 
a{T;g{n)): index of T"" highest entry in g(n). 
Init: Sense in each channel once, n ■(— C 
Loop: n 71 + 1 

Curr_Sel ^ channels with [/-best entries in g(r7,). If free, 
transmit. 



The regret under (^"™ statistic in (|34] | satisfies 

1 log 71 



i?(n;/x,l,pi(gf™)) < ^ A(r,7) 



A(j*,7)2+ +y 



B. Centralized Learning & Access for Multiple Users 

We now consider multiple secondary users under centralized 
access policies where there is joint learning and access by a 
central agent on behalf of all the U users. Here, to minimize 
the sum regret, the centralized policy allocates the U users to 
orthogonal channels to avoid collisions. Let p"™^{X^), with 
:= '-'i^i '-'^1 -^i^i' denote a centralized policy based 
on the sensing variables of all the users. The policy under 
centralized learning is a simple generalization of the single- 
user policy and is given in Algorithm |2] We now recap the 
results of 

Theorem 2 (Regret Under Centralized Policy p™^ [9)): 
For any uniformly good centralized policy p™"^^ satisfying 
(|6]l, the expected times spent in a [/ -worst channel i satisfies 



lim 



/ N ^ (1 -e)logn 



1, 



(10) 



where U* is the channel with the t/"' best availability. Hence, 
the regret satisfies 



lim inf 



log n 



> 1^ 7.7^^^- (11) 

i f7-worst 

The scheme in Algorithm |2] based on achieves the above 
bound. 



lim 



logn 



E 



A{U*,t) 



(12) 



iGt/-worst 

The scheme in Algorithm |2] based on the g' 
any n > 0, 



™™ satisfies for 



i?(n;/x,C/,p--(g"™)) 



< 



E E E ^*'"^*'^'* 

m— 1 2Gt/-worst k—1 ^ 



81ogn 
A(m*,iY 



1 + — 



Proof: See Appendix lAl 



(13) 

□ 



IV. Main Results 



Armed with the classical results on multi-armed bandits, we 
now design distributed learning and allocation poUcies. 



A. Preliminaries: Bounds on Regret 

We first provide simple bounds on the regret in ([3]) for any 
distributed learning and access policy p. 

Proposition 1 (Lower and Upper Bounds on Regret): The 
regret under any distributed policy p satisfies 

u 

Rin;p)>J2 E A([/*,j)E[T,,,(n)], (14) 

U 

R{n;p)<fi{V) T.^[T^A^)]+nM{n)] , (15) 

j — l zeEt/-worst 

where Tij{n) is the number of slots where user j selects 
channel i for sensing, M{n) is the number of collisions faced 
by the users in the U -best channels in n slots, A{i,j) = ^{i) — 
p,{j) and is the highest mean availability. 

Proof: See Appendix IB] □ 

In the subsequent sections, we propose distributed learning 
and access policies and provide regret guarantees for the 
policies using the upper bound in ( fTSl l. The lower bound in 
(fT4l i can be used to derive lower bound on regret for any 
uniformly-good policy. 

The first term in ( fTSl l represents the lost transmission 
opportunities due to selection of LZ-worst channels (with 
lower mean availabilities), while the second term represents 
performance loss due to collisions among the users in the 
[/-best channels. The first term in ( fTSl l decouples among the 
different users and can be analyzed solely through the marginal 
distributions of the ^-statistics at the users. This in turn, can 
be analyzed by manipulating the classical results on multi- 
armed bandits ifTOl . ifTTl . On the other hand, the second term 
in ( fTSl ). involving collisions in the [/-best channels, requires 
the joint distribution of the g-statistics at different users which 
are correlated variables. This is intractable to analyze directly 
and we develop techniques to bound this term. 

B. p^™' : Distributed Learning and Access 

We present the p"*™ policy in Algorithm [3] Before de- 
scribing this policy, we make some simple observations. If 
each user implemented the single-user policy in Algorithm [T] 
then it would result in collisions, since all the users target 
the best channel. When there are multiple users and there 
is no direct communication among them, the users need to 
randomize channel access in order to avoid collisions. At 
the same time, accessing the [/-worst channels needs to be 
avoided since they contribute to regret. Hence, users can avoid 
collisions by randomizing access over the [/-best channels, 
based on their estimates of the channel ranks. However, if the 
users randomize in every slot, there is a finite probability of 
collisions in every slot and this results in a linear growth of 
regret with the number of time slots. Hence, the users need 
to converge to a collision-free configuration to ensure that the 
regret is logarithmic. 

In Algorithm [3] there is adaptive randomization based 
on feedback regarding the previous transmission. Each user 
randomizes only if there is a collision in the previous slot; 
otherwise, the previously generated random rank for the user 
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Algorithm 3 Policy p^^^{U,C,gj{n)) for each user j under 
U users, C channels and statistic gj{n). 

Input: {Xij{n)}i^i c ■ Sample-mean availabilities at 

user j after n rounds, gj{i;n): statistic based on Xij{n), 

(t{T; gj{n)): index of T"" highest entry in gj{n). 

C,j{i; n): indicator of collision at n"' slot at channel i 

Init: Sense in each channel once, n -(r- C, Curr_Rank ^ 1, 

Qii] m) ^0 

Loop: n <— ri + 1 

if Q{Curr_Sel; n - 1) = 1 then 

Draw a new Curr_Rank ^ Unif(/7) 

end if 

Select channel for sensing. If free, transmit. 

Curr_Sel a{Curr_Rank; gj{n)). 

If colUsion Q{Curr_Sel;m) <~ 1, Else 0. 



is retained. The estimation for the channel ranks is through 
the ^-statistic, on lines similar to the single-user case. 

C. Regret Bounds under ff^"" 

It is easy to see that the p"**"""^ policy ensures that the 
users are allocated orthogonally to the [/-best channels as 
the number of transmission slots goes to infinity. The regret 
bounds on p^"^™ are however not immediately clear and we 
provide guarantees below. 

We first provide a logarithmic upper bouncH on the number 
of slots spent by each user in any [/-worst channel. Hence, the 
first term in the bound on regret in ( fTsT i is also logarithmic. 

Lemma 1 (Time Spent in U-worst Channels): Under the 
^RAND scj^guje in Algorithm [3] the total time spent by any user 
j = 1, ...,[/, in any i ^ U -worst channel is given by 



E[r,,,(n)] < 



k=l ^ ' ' 



(16) 



Proof: The proof is on lines similar to the proof for 
Theorem |2] given in Appendix |A] □ 
We now focus on analyzing the number of collisions M{n) 
in the U -best channels. We first give a result on the expected 
number of collisions in the ideal scenario where each user has 
perfect knowledge of the channel availability statistics /x. In 
this case, the users attempt to reach an orthogonal (collision- 
free) configuration by uniformly randomizing over the U -best 
channels. 

The stochastic process in this case is a finite-state Markov 
chain. A state in this Markov chain corresponds to a con- 
figuration of U number of (identical) users in U number of 
channels. The number of states in the Markov chain is the 
number of compositions of U, given by (^^^^) [24,Thm. 5.1]. 
The orthogonal configuration corresponds to the absorbing 
state. For any other state, consisting of more than one user 
or no user in any of the channels, the transition probability 
to any state of the Markov chain (including self transition 

'Note that the bound on E[Ti j(n)] in U6) holds for user j even if the 
other users are using a policy other than p"™ ". B ut on the other hand, to 
analyze the number of collisions E[M(n)] in )19t , we need every user to 
implement p"**™. 



probability) is uniform. For a state, where certain channels 
have exactly one user, there are only transitions to states which 
consist of at least one user in that channel and the transition 
probabilities are uniform. Let T([/, U) denote the maximum 
time to absorption in the above Markov chain starting from 
any initial distribution. We have the following result 

Lemma 2 {# of Collisions Under Perfect Knowledge): 
The expected number of collisions under p^™^ scheme in 
Algorithm [3] assuming that each user has perfect knowledge 
of the mean channel availabilities /i, is given by 



E[M(n);/' 



\U,C,ti)]<UE[T{U,U)] 
' ^2U -1 



< U 



u 



-1 



(17) 



Proof: See Appendix O □ 
The above result states that there is at most a finite number 
of expected collisions, bounded by UK[T{U, U)] under perfect 
knowledge of /jl. In contrast, recall from the previous section, 
that there are no collisions under perfect knowledge of fi 
in the presence of pre-allocated ranks. Hence, UE[T{U, U)] 
represents a bound on the additional regret due to the lack 
of direct communication among the users to negotiate their 
ranks. 

We use the result of Lemma |2] for analyzing the num- 
ber of collisions under distributed learning of the unknown 
availabilities fi as follows: if we show that the users are 
able to learn the correct order of the different channels with 
only logarithmic regret then only an additional finite expected 
number of collisions occur before reaching an orthogonal 
configuration. 

Define T'{n; p"**™) as the number of slots where any one 
of the top-f/ estimated ranks of the channels at some user is 
wrong under p'^^™ policy. Below we prove that its expected 
value is logarithmic in the number of transmissions. 

Lemma 3 (Wrong Order of g-statistics): Under the p"*™ 
scheme in Algorithm |3] 



E[r'(n;p''™°)] < U 



u c 

E E 

a— 1 h—a+1 



ilogn 



A(a*,6*)2 



(18) 



Proof: See Appendix ID] □ 
We now provide an upper bound on the number of collisions 
M [n] in the [/-best channels by incorporating the above result 
on E[T'(n)], the result on the average number of slots E[r,;.j] 
spent in the U -worst channels in Lemma [T] and the average 
number of collisions [/E[T([/, [/)] under perfect knowledge 
of /X in Lemma |2l 

Theorem 3 (Logarithmic Number of Collisions Under 
The expected number of collisions in the [/-best channels 
under p''™°([/, C, g"™) scheme satisfies 



E[M(n)] < [/(E[T([/,[/)] + l)E[T'(n)]. 



(19) 



Hence, from (O, (HI and O, M{n) = 0(log7i). 
Proof: See Appendix |E| □ 
Hence, there are only logarithmic number of expected 
collisions before the users settle in the orthogonal channels. 
Combining this result with Lemma [T] that the number of 
slots spent in the [/-worst channels is also logarithmic, we 
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immediately have one of the main results of this paper that the 
sum regret under distributed learning and access is logarithmic. 

Theorem 4 (Logarithmic Regret Under p"'™); The policy 
p''™°(C/,C,g"™) in Algorithm [3] has e(logn) regret. 
Proof: Substituting (O and ([T6]l in ([TSll. □ 

Hence, we prove that distributed learning and channel access 
among multiple secondary users is possible with logarithmic 
regret without any explicit communication among the users. 
This implies that the number of lost opportunities for success- 
ful transmissions at all secondary users is only logarithmic in 
the number of transmissions, which is negligible when there 
are large number of transmissions. 

We have so far focused on designing schemes that maximize 
system or social throughput. We now briefly discuss the 
fairness for an individual user under p^'™°. Since p^'^™ does 
not distinguish any of the users, in the sense that each user 
has equal probability of "settling" down in one of the U- 
best channels while experiencing only logarithmic regret in 
doing so. Simulations in Section IVIII (in Fig|4]l demonstrate 
this phenomenon. 

V. Distributed Learning and Access under 
Unknown Number of Users 

We have so far assumed that the number of secondary 
users is known, and is required for the implementation of 
the p^™° policy. In practice, this entails initial announcement 
from each of the secondary users to indicate their presence in 
the cognitive network. However, in a truly distributed setting 
without any information exchange among the users, such an 
announcement may not be possible. 

In this section, we consider the scenario, where the number 
of users U is unknown (but fixed throughout the duration of 
transmissions and U < C, the number of channels). In this 
case, the policy needs to estimate the number of secondary 
users in the system, in addition to learning the channel 
availability statistics and designing channel access rules based 
on collision feedback. Note that if the policy assumed the 
worst-case scenario that U = C, then the regret grows linearly 
since [/-worst channels are selected a large number of times 
for sensing. 

A. Description of p'^^^ Policy 

We now propose a policy p^^^ in Algorithm |4] This policy 
incorporates two functions in each transmission slot, viz., 
execution of the p^'™^ policy in Algorithm [3] based on the 
current estimate of the number of users U, and updating of 
the estimate U based on the number of collisions experienced 
by the user. 

The updating is based on the idea that if there is under- 
estimation of U at all the users {Uj < C/ at all the users j), 
collisions necessarily build up and the collision count serves 
as a criterion for incrementing U . This is because after a long 
learning period, the users learn the true ranks of the channels, 
and target the same set of channels. However, when there is 
under-estimation, the number of users exceeds the number of 
channels targeted by the users. Hence, collisions among the 



Algorithm 4 Policy p^^{n, C, gj(w), ^) for each user j under 
n transmission slots (horizon length), C channels, statistic 
gj (m) and threshold functions ^. 

1) Input: {Xi j{n)}i^i c '■ Sample-mean availabilities 
at user j, gj{i;n): statistic based on Xi j{n), 

a{T; gj{n)): index of T"' highest entry in gj{n). 
^ (i; n): indicator of collision at n"' slot at channel i 
U: current estimate of the number of users. 
n: horizon (total number of slots for transmission) 

2) Init: Sense each channel once, rn <— C, Curr_Rank ^ 
1, U^l, Cj(i;m)^0 for afl i = 1,...,C 

3) Loop: m <— m + 1, stop when m = n. 

4) If QiCurr_Sel; m - 1) = 1 then^ 

Draw a new Curr_Rank ~ Unif(C/). end if 
Select channel for sensing. If free, transmit. 
Curr_Sel <— a{Curr_Rank; gj(m)) 

5) Q{Curr_Sel;m) <— 1 if collision, o.w. 

6) If Er^i ELi CMk; Sj{m)); a) > ^n; U)) then 

L/ ^ J7 + 1, («; a) ^ 0, i 1, . . . C, a = 1, . . . , TO. 
end if 



users accumulate, and can be used as a test for incrementing 
U. 

Denote the colhsion count used by p^^^ policy as 

m k 

$fe,,(TO) ^^0(^(6;g,(TO));a). (20) 

a=l b=l 

which is the total number of collisions experienced by user j 
so far (till the to"" transmission slot) in the top-[/j channels, 
where the ranks of the channels are estimated using the g- 
statistics. The collision count is tested against a threshold 
S,{n;Uj), which is a function of the horizon lengtl@ and 
current estimate Uj. When the threshold is exceeded, Uj is 
incremented, and the collision samples collected so far are 
discarded (by setting them to zero) (line |6] in Algorithmic. 

B. Regret Bounds under p'^^^ 

We analyze regret bounds under the fP"^ policy, where the 
regret is defined in (|3]l. Let the maximum threshold function 
for the number of consecutive collisions under p^^^ policy be 
denoted by 

f*{n\U\.^ max £(n;fc). (21) 

k=\,...,v 

We prove that the p^^ policy has 0(£*(n;C/)) regret when 
£*(n; [/) = a;(logn), and where n is the number of transmis- 
sion slots. 

The proof for the regret bound under (P'' policy consists 
of two main parts: we prove bounds on regret conditioned on 
the event that none of the users over-estimate U . Second, we 
show that the probability of over-estimation at any of the users 

''In this section, we assume that the users are aware of the horizon length 
n for transmission. Note that this is not a limitation and can be extended 
to case of unknown horizon length as follows: implement the algorithm by 
fixing horizon lengths to no i 2rao , 4rao ■ • • for a fixed no G N and discarding 
estimates from previous stages. 
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goes to zero asymptotically. Combined together, we obtain the 
regret bound for p^'^ policy. 

Note that in order to have small regret, it is crucial that 
none of the users over-estimate U. This is because when there 
is over-estimation, there is a finite probability of selecting 
the J7-worst channels even upon learning the true ranks of 
the channels. Note that regret is incurred whenever a U-worst 
channel is selected since under perfect knowledge this channel 
would not be selected. Hence, under over-estimation, the regret 
grows linearly in the number of transmissions. 

In a nutshell, under the p^'^ policy, the decision to increment 
the estimate U reduces to a hypothesis-testing problem with 
hypotheses "Hq: number of users is less than or equal to the 
current estimate and Hi: number of users is greater than 
the current estimate. In order to have a sub-linear regret, 
the false-alarm probability (deciding Hi under Hq) needs to 
decay asymptotically. This is ensured by selecting appropriate 
thresholds £_{n) to test against the collision counts obtained 
through feedback. 

Conditional Regret: We now give the result for the first 
part. Define the "good event" C{n; U) that none of the users 
over-estimates U under p^^^ as 



(n) < U}. 



(22) 



denoted by 



C(n; {/):={ fjt^r 

The regret conditioned on C{n; U), 
R{n; fi, U, p'^^'^) \C{n; U), is given by 

u c u 

" E '^(^*) -EE W u)] , 

k=l i=l j=l 

where Vij{n) is the number of times that user j is the sole 
user of channel i. Similarly, we have conditional expectations 
of K[Tij{n)\C{n;U)] and of the number of collisions in U- 
best channels, given by E[M(n)|C(n; U)]. We now show that 
the regret conditioned onC{n;U) is 0(max(^*(n; C/),logn)). 

Lemma 4: ( Conditional Regret): When all the U secondary 
users implement p^^^ policy, we have for all i G [/-worst 
channel and each user j — 1, . . . ,U, 
u 



E[T,,(n)|C(n)] <^ 



k=l 



81ogn 
A(j,fc*)2 



1 



(23) 



The conditional expectation on number of colhsions M(n) in 
the [/-best channel satisfies 

u 

E[M(n)|C(n;[/)] < [/^C(n;fc) < U'^C{n]U). (24) 

k=l 

From ( fTSl ). we have R[n)\C{n\ U) is 0(max(^*(n; U), \ogn)) 
for any n e N. 

Proof: See Appendix |F] □ 
Probability of Over-estimation: We now prove that none 
of the users over-estimate^ U under p^^~' policy, i.e., the 
probability of the event C{n; U) in (l22l i approaches one as 

'Note that p^^^ policy automatically ensures that all the users do not 
under-estimate U, since it increments U based on collision estimate. This 
implies that the probability of the event that all the users under-estimate U 
goes to zero asymptotically. 



n oo, when the thresholds ^(n; U) for testing against 
the collision count are chosen appropriately (see line |6] in 
Algorithm |4]l. Trivially, we can set ^(n; 1) = 1 since a single 
collision is enough to indicate that there is more than one user. 
For any other fc > 1, we choose functions ^ satisfying 



£,[n\k) = uj{\ogn), Vfc > 1. 



(25) 



We prove that the above condition ensures that over-estimation 
does not occur. 

Recall that T'{n] p^"^) is the number of slots where any one 
of the top-[/ estimated ranks of the channels at some user is 
wrong under p^'^ policy. We show that E[T'(n)] is O(logn). 

Lemma 5 (Time spent with wrong estimates): The 
expected number of slots where any of the top-[/ estimated 
ranks of the channels at any user is wrong under p^'^ policy 
satisfies 



E[T'{n)]<Uj2 E 



a— 1 b—a+1 



8\ogn 
A(a*,fo*)2 



(26) 



Proof: The proof is on the lines of Lemma [3] □ 
Recall the definition of T ([/,[/) in the previous section, 
as the maximum time to absorption starting from any initial 
distribution of the finite-state Markov chain, where the states 
correspond to different user configurations and the absorbing 
state corresponds to the collision-free configuration. We now 
generalize the definition to T{U, k), as the time to absorption 
in a new Markov chain, where the state space is the set of 
configurations of U users in k channels, and the transition 
probabilities are defined on similar lines. Note that T([/, k) 
is almost-surely finite when k > U and oo otherwise (since 
there is no absorbing state in the latter case). 

We now bound the maximum value of the collision count 
$A;.j(m) under p^~' policy in (|20] | using T'{m), the total time 
spent with wrong channel estimates, and T{U, k), the time to 

St 

absorption in the Markov chain. Let < denote the stochastic 
order for two random variables [25 J . 

Proposition 2: The maximum collision count in (l20l i over 
all users under the p^^^ policy satisfies 

max $fc j(m) < (T'(m) + 1)T([/, fc), Vm e N. (27) 

Proof: The proof is on the lines of Theorem |3] See Ap- 
pendix |G] □ 

We now prove that the probability of over-estimation goes 
to zero asymptotically. 

Lemma 6 (No Over-estimation Under p'^^'^): For threshold 
functions satisfying (|25l l, the event C{n\ U) in (|22] | satisfies 



lim P[C(n; [/)] 



(28) 



and hence, none of the users over-estimates U under p^'' 
policy. 

Proof: See Appendix IHl □ 
We now give the main result of this section that p^'' has 
slightly more than logarithmic regret asymptotically and this 
depends on the threshold function U) in (|2T]i- 
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Theorem 5 (Asymptotic Regret Under p'^^'^): With threshold 
functions ^ satisfying conditions in d25T l. the poHcy 
p^^^{n,C,gj{m),^) in Algorithm |4] satisfies 

R{n; fi, U, ff 



lim sup ■ 



< oo. 



(29) 



Proof: From Lemma |4] and Lemma |6l □ 
Hence, the regret under the proposed p"^^^ policy is 
0(^*(rt;J7)) under fully decentralized setting without the 
knowledge of number of users when ^*(7i;[/) = w(log?i). 
Hence, 0(/(n) log n) regret is achievable for all functions 
/(n) cx) as n ^ cxD. The question of whether logarithmic 
regret is possible under unknown number of users is of interest. 

Note the difference between p^'' policy in Algorithm |4] 
under unknown number of users with p^"^° policy with known 
number of users in Algorithm [3] The regret under p^~' is 
0(/(n)logn) for any function /(n) = a;(l), while it is 
O(logn) under p"*™ policy. Hence, we are able to quantify 
the degradation of performance when the number of users is 
unknown. 

VI. Lower Bound & Effect of Number of Users 
A. Lower Bound For Distributed Learning & access 

We have so far designed distributed learning and access 
policies with provable bounds on regret. We now discuss the 
relative performance of these policies, compared to the optimal 
learning and access policies. This is accomplished by noting 
a lower bound on regret for any uniformly-good policy, first 
derived in ID for a general class of uniformly-good time- 
division policies. We restate the result below. 

Theorem 6 (Lower Bound [4]): For any uniformly good 
distributed learning and access policy p, the sum regret in 
^ satisfies 



iiminf^^-:^'^'^)> E'^^""*'^^ 



i^C/- worst j - 



^ D(p^,pj. 



(30) 



The lower bound derived in |9| for centralized learning and 
access holds for distributed learning and access considered 
here. But a better lower bound is obtained above by consid- 
ering the distributed nature of learning. The lower bound for 
distributed policies is worse than the bound for the centralized 
policies in ( fTTT l. This is because each user independently learns 
the channel availabilities /x in a distributed policy, whereas 
sensing decisions from all the users are used for learning in a 
centralized policy. 

Our distributed learning and access policy p^^'^° matches 
the lower bound on regret in (fTSl l in the order (logn) but the 
scaling factors are different. It is not clear if the regret lower 
bound in ( l30b can be achieved by any policy under no explicit 
information exchange and is a topic for future investigation. 

B. Behavior with Number of Users 

We have so far analyzed the sum regret under our policies 
under a fixed number of users U . We now analyze the behavior 
of regret growth as U increases while keeping the number of 
channels C > U fixed. 



Theorem 7 (Varying Number of Users): When the number 
of channels C is fixed and the number of users [/ < C is 
varied, the sum regret under centralized learning and access 
^cENT ([X2I 1 decreases as U increases while the upper bounds 
on the sum regret under p"*™ in ( fTsT i monotonically increases 
with U. 

Proof: The proof involves analysis of ( fT2] l and (fTsT l. To 
prove that the sum regret under centralized learning and access 
in ( fT2b decreases with the number of users U, it suffices to 
show that for i £ [/-worst channel, 

A{U*,i) 



D{p^,pu*) 

decreases as U increases. Note that p{U*) and D{pi,pij*) 
decrease as U increases. Hence, it suffices to show that 

Ku*) 

D{pi,pu*) 

decreases with U . This is true since its derivative with respect 
to [/ is negative. 

For the upper bound on regret under p^"^^ in ( fTST i, when U 
is increased, the number of [/-worst channels decreases and 
hence, the first term in ( fTSl l decreases. However, the second 
term consisting of collisions M{n) increases to a far greater 
extent. □ 

Note that the above results is for the upper bound on regret 
under the p^™^ policy and not the regret itself. Simulations in 
Section [Vlll reveal that the actual regret also increases with U . 
Under the centralized scheme p™^, as U increases, the number 
of [/-worst channels decreases. Hence, the regret decreases, 
since there are less number of possibilities of making bad 
decisions. However, for distributed schemes although this 
effect exists, it is far outweighed by the increase in regret 
due to the increase in collisions among the U users. 

In contrast, the distributed lower bound in (|30] | displays 
anomalous behavior with U since it fails to account for 
collisions among the users. Here, as U increases there are 
two competing effects: a decrease in regret due to decrease 
in the number of [/-worst channels and an increase in regret 
due to increase in the number of users visiting these U -worst 
channels. 

VII. Numerical Results 

We present simulations that vary the schemes and the 
number of users and channels to verify the performance of 
the algorithms detailed earlier We consider C= 9 channels 
(or a subset of them when the number of channels is varying) 
with probabilities of availability characterized by Bernoulli 
distributions with evenly spaced parameters ranging from 0.1 
to 0.9. 

Comparison of Different Schemes: Fig|2a] compares the 
regret under the centralized and random allocation schemes in 
a scenario with [/ = 4 cognitive users vying for access to the 
C = 9 channels. The theoretical lower bound for the regret 
in the centralized case from Theorem |2] and the distributed 
case from Theorem |6] are also plotted. The upper bounds on 
the random allocation scheme from Theorem |4] is not plotted 
here, since the bounds are loose especially as the number of 
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■"Random Allocalion Scheme 
■"Cemral Allocalion Scheme 
"Distributed Lower Bound 
"Centralized Lower Bound 




(a) Normalized regret 



-R(n) 



VS. n slots. 



U = i users, C = 9 channels. 



- p'^'"'" with g'vr^™ 
-pRAND„i,^ gOPT 

"Distributed Lower Bound 




(b) Normalized regret 



U = 4 users, C ■■ 



log rt 

9 channels. 



vs. n slots. 



Fig. 2. Simulation Results. Probability of Availability /j. = [0.1, 0.2, . . . , 0.9]. 



- fi. unknown 

-/A known: C7E[T(r7, U)] 
-Upper Bound on UE[r{U, [/)] 




(c) No. of collisions M{n) vs. n slots. 



U = i users, C = 9 channels, 



policy. 




users U increases. Finding tight upper bounds is a subject of 
future study. 

As expected, centralized allocation has the least regret. 
Another important observation is the gap between the lower 
bounds on the regret and the actual regret in both the dis- 
tributed and the centralized cases. In the centralized scenario, 
this is simply due to using the g'^^^'^ statistic in (|34] | instead 
of the optimal statistic in (|5]l. However, in the distributed 
case, there is an additional gap since we do not account for 
collisions among the users. Hence, the schemes under con- 
sideration are O(logn) and achieve order optimality although 
they are not optimal in the scaling constant. 

Performance with Varying U and C: Fig|3a] explores the 
impact of increasing the number of secondary users U on the 
regret experienced by the different policies while fixing the 
number of channels C. With increasing U, the regret decreases 
for the centralized schemes and increases for the distributed 
schemes, as predicted in Theorem |7] The monotonic increase 
of regret under random allocation p"*™^ is a result of the 
increase in the collisions as U increases. While the monotonic 
decreasing behavior in the centralized case is because as the 
number of users increases, the number of [/-worst channels 
decreases resulting in lower regret. Also, the lower bound 
for the distributed case in (|30] | initially increases and then 
decreases with U This is because as U increases there are 
two competing effects: decrease in regret due to decrease in 
number of [/-worst channels and increase in regret due to 
increase in number of users visiting these U -worst channels. 



Fig EH evaluates the performance of the different algorithms 
as the number of channels C is varied while fixing the number 
of users U . The probability of availability of each additional 
channel is set higher than those already present. Here, the 
regret monotonically increases with C in all cases. When the 
number of channels increases along with the quality of the 
channels, the regret increases as a result of an increase in the 
number of U -worst channels as well as the increasing gap in 
quality between the [/-best and U -worst channels. 

Also, the situation where the ratio ^ is fixed to be 0.5 
and both the number of users and channels along with their 
quality increase is considered in Fig|3c] As the number of 
users increases the regret increases as the number of channels 
C and their quality are both increasing. Once again, this is 
in agreement with theory as the number of U -worst channels 
increases as U and C increase while keeping ^ fixed. 

Collisions and Learning: Fig|2c] verifies the logarithmic 
nature of the number collisions under the random allocation 
scheme p"*™. Additionally, we also plot the number of col- 
lisions under p"*™ in the ideal scenario when the channel 
availability statistics /i. are known to see the effect of learning 
on the number of collisions. The low value of the number 
of collisions obtained under known channel parameters in 
the simulations is in agreement with theoretical predictions, 
analyzed as [/E[T([/, [/)] in Lemma |2] As the number of 
slots n increases, the gap between the number of collisions 
under the known and unknown parameters increases since the 
former converges to a finite constant while the latter grows as 
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Fig. 4. Simulation Results. Probability of Availability /x = 

[0.1, 0.2, . . . , 0.9]. No. of slots where user has best channel vs. user. C/ = 4, 
C = 9,n = 2500 slots, 1000 runs, p"™"^. 



O(logn). The logarithmic behavior of the cumulative number 
of collisions can be inferred from Fig|2a] However, the curve 
in Fig|2c]for the unknown parameter case appears linear in n 
due to the small value of n. 

Difference between g°" and g"™ : Since the statistic g"'^*'"' 
used in the schemes in this paper differs from the optimal 
statistic in Q, a simulation is done to compare the perfor- 
mance of the schemes under both the statistics. As expected, in 
FigEH the optimal scheme has better performance. However, 
the use of g^^"^ enables us to provide finite-time bounds, as 
described earlier. 

Fairness: One of the important features of p"*™ is that 
it does not favor any one user over another. Each user has 
an equal chance of settling down in any one of the J7-best 
channels. Fig|4] evaluates the fairness characteristics of p"**™. 
The simulation assumes [/ = 4 cognitive users vying for 
access to C = 9 channels. The graph depicts which user 
asymptotically gets the best channel over 1000 runs of the 
random allocation scheme. As can be seen, each user has 
approximately the same frequency of being allotted the best 
channel indicating that the random allocation scheme is indeed 
fair. 



VIII. Conclusion 

In this paper, we proposed novel policies for distributed 
learning of channel availability statistics and channel access 
of multiple secondary users in a cognitive network. The first 
policy assumed that the number of secondary users in the 
network is known, while the second policy removed this 
requirement. We provide provable guarantees for our policies 
in terms of sum regret. Combined with the lower bound on 
regret for any uniformly-good learning and access policy, our 
first policy achieves order-optimal regret while our second 
policy is also nearly order optimal. Our analysis in this paper 
provides insights on incorporating learning and distributed 
medium access control in a practical cognitive network. 

The results of this paper open up an interesting array of 
problems for future investigation. Our assumptions of an i.i.d. 
model for primary user transmissions and perfect sensing at 
the secondary users need to be relaxed. Our policy allows for 
an unknown but fixed number of secondary users, and it is of 
interest to incorporate users dynamically entering and leaving 



the system. Moreover, our model ignores dynamic traffic at 
the secondary nodes and extension to a queueing-theoretic 
formulation is desirable. 
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Appendix 

A. Proof of Theorem |2] 

The result in ( fTsT l involves extending the results of flT] 
Thm. 1]. Define Ti{n):— Y^^=i the number of times 

a channel i is sensed in n rounds for all users. We will show 
that 

8 log n TT^ "I 

^ - + 1 H 

3 



m{n)]< E 



k£U -best 



A(yfc*,i)2 



Vi G [/-worst. 

(31) 



We have 



P[Tx. in i in n'" slot] = ¥[g{U*;n) < g{i; n)], 
--V[A{i;n)n{giU*;n)<g{i;n))] 
+ P[^^(z;n)n(5(f/*;n) <ff(z;n))], 



where 



A{i;n):= |J {g{k;n) < g{i;n)) 

keU -best 

is the event that at least one of the [/-best channels has g- 
statistic less than i. Hence, from union bound we have 

nAii;n)]< J2 n9ik;n)<gii;n)]. 

keU -best 

We have for C >U, 

P[^=(z;n)n(5(C/*;n) <g(z;n))] =0, 

Hence, 

P[Tx. in i in ti'" round] < ^ F[g{k;n) < g{i;n)]. 

A; G -best 

On the lines of flT, Thm. 1], we have Vfc,i : 
k is C/-best, i is [/-worst 



Y.I[g{k;l)<gii;l)]< 



1=1 



8 logn 

A{k*,iy 



1 



TT 



Hence, we have ( ISTT i. For the bound on regret, we can break 
R in ^ into two terms 

i?(n;M,[/,P^n= E [j7E^(^*'*)]iE[^'W] 



i ?7-woist 



1=1 
U 



ie;7-best 1=1 



E[T,(n)]. 
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For the second term, we have 



ieU -best 1=1 



min)] 



ie;7-best 1=1 



-0, 



where T*(n):= max TAn). Hence, we have the bound. □ 

ie(7-best 

B. Proof of Proposition [7] 

For convenience, let Ti{n) := X^^Li '■— 

u c 
J2j=i ^iji^)- Note that ^i^iTi{n) — nil, since each user 

selects one channel for sensing in each slot and there are U 

users. From (O, 



R{n) -n^/i(z*)-^A^WE[F,(n)], 

i=l 1=1 
ieC/-best 



(32) 

ieU -best 

=fi{V){E[M{n)]+ E[T,(n)]), (33) 

iGC/-woi"st 

where Eqn.(l32]l uses the fact that Vi{n) < n since total number 
of sole occupancies in n slots of channel i is at most n, and 
Eqn.® uses the fact that M{n) = E»ec/-best(^'(") " ^'('^))- 
For the lower bound, since each user selects one channel 
for sensing in each slot, j ("-) — Now 



f]f]f]A([/*,z)E[r,,(n)] 

k=l j = l i=l 



j — l 2 GC- worst 



□ 



C. Proof of Lemma \2\ 

Although, we could directly compute the time to absorption 
of the Markov chain, we give a simple bound E[T([/, [/)] by 
considering an i.i.d process over the same state space. We term 
this process as a genie-aided modification of random allocation 
scheme, since this can be realized as follows: in each slot, a 
genie checks if any collision occurred, in which case, a new 
random variable is drawn from Unif([/) by all users. This is 
in contrast to the original random allocation scheme where a 
new random variable is drawn only when the particular user 
experiences a collision. Note that for U = 2 users, the two 
scenarios coincide. 

For the genie-aided scheme, the expected number of slots to 
hit orthogonality is just the mean of the geometric distribution 



^ fc(l - p)'=p = < oo, 

k=l 



P 



(34) 



where p is the probability of having an orthogonal configura- 
tion in a slot. This is in fact the reciprocal of the number of 
compositions of U L24, Thm. 5.1], given by 

-1 



P = 



2U -1 
U 



(35) 



The above expression is nothing but the reciprocal of number 
of ways U identical balls (users) can be placed in U different 
bins (channels): there are 2U — 1 possible positions to form 
U partitions of the balls. 

Now for the random allocation scheme without the genie, 
any user not experiencing collision does not draw a new 
variable from Unif([/). Hence, the number of possible config- 
urations in any slot is lower than under genie-aided scheme. 
Since there is only one configuration satisfying orthogonalitjH, 
the probability of orthogonality increases in the absence of the 
genie and is at least ( l35T l. Hence, the number of slots to reach 
orthogonality without the genie is at most ( [34] i. Since in any 
slot, at most U collisions occur, (fTTT i holds. □ 

D. Proof of Lemma \3\ 



Let Cn,ni- 



2 log n 



Case 1: Consider U — C = 2 first. Let 

^(t,0:={5r'(l*;^ - 1) ^ 5r™(2*;i - - 1) > I}. 

On lines of HI] Thm. 1], 



T'{n) <l + J2liAtJ)], 

t=2 

oo t 
t=l m-\-h=l 

The above event is implied by 

Xi.j{h) + Ct,h < X2-\j{h) + Ct,h+m 
since Ct,ni > Ct^h+m- 

The above event implies at least one of the following events 
and hence, we can use the union bound. 

Hi- < ^2* + 2ct,h+m- 

From the Chernoff-Hoeffding bound, 

P[^l-,,W <Ml* ~CtM] <t-\ 

and the event that /^i. < ^2* + 2ct,h+m implies that 

Slogt 



m < 



^l*,2* 



Since 



EEE2^-^-T' 



t=l m=l h=l 
^since all users are identical for this analysis. 
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Case 2: For min(t/, C) > 2, we have 

a— 1 h—a+1 m— 1 

where a* and b* represent channels with a"" and 6"" highest 
availabilities. On lines of the result for ?7 = C = 2, we can 
show that 

n a 1 2 

ETT7 rr MEAN/ + _\ ^ MEAN/J+ _M ^ O lOg 71 , 1 , 
EJ[c?^- (a ; m) < g^- [h ; m)J < -2 + 1 + — . 

m=l 

Hence, ([T8]l holds. □ 

Proof of Theorem \3\ 

Define the good event as all users having correct top-C/ 
order of the ^-statistics, given by 

u 

S(?T.):= [^{Top-t/ entries of gj(n) are same as in /x}. 
The number of slots under the bad event is 



E m^rn)] = T'in), 



by definition of T'{n). In each slot, either a good or a bad 
event occurs. Let 7 be the total number of collisions in U- 
best channels between two bad events, i.e., under a run of 
good events. In this case, all the users have the correct top-C/ 
ranks of channels and hence, 

E[7|g(n)] < C/E[T(C/,f/)] < 00, 

where E[T([/, [/)] is given by JTtI i. Hence, each transition 
from the bad to the good state results in at most C/E[T(J7, U)] 
expected number of collisions in the [/-best channels. The 
expected number of collisions under the bad event is at most 
C/E[r'(n)]. Hence, (Hg holds. □ 



F. Proof of Lemma |4] 

Under C(n; [/), a [/-worst channel is sensed only if it is 
mistaken to be a [/-best channel. Hence, on lines of Lemma [1] 

E[Tij(n)|C(n; [/)] = O(logn), Mi £ [/-worst, j = !,...,[/. 

For the number of collisions M(n) in the [/-best channels, 
there can be at most U X)fc=i ?('^; ^) collisions in the U -best 
channels where a :— maxj=i....^[7 Uj is the maximum estimate 
of number of users. Conditioned on C{n;U,), a < U, and 
hence, we have (|24] |. □ 



G. Proof of Proposition |2] 

Define the good event as all users having correct top-[/ 
order, given by 

u 

3(71):= {Top-[/ entries of gj(n) are same as in /x}. 

/=i 

The number of slots under the bad event is 

n 

Y,imm)]=T'{n), 

m—l 

by definition of T'{n). In each slot, either a good or a bad 
event occurs. Let 7 be the total number of collisions in fc-best 
channels between two bad events, i.e., under a run of good 
events. In this case, all the users have the correct top-[/ ranks 
of channels and hence, 

7|S(n) < [/T([/,fc), 

The number of collisions under the bad event is at most T'{n). 
Hence, ^ holds. □ 

H. Proof of Lemma |21 
We are interested in 

P[C=(72); [/] = P[uy^i[/f"(7i) > [/], 

n U 

= n\J \j{'^uA^)>an;u)}], 

ni—l j — 1 

= P[ max <i>uj{n) > ^{n;U)], 

j = l,...,U 

where 4> is given by ( l20l l. For [/ = 1, we have P[C'^(7i); [/] = 
since no collisions occur. 
Using (|27] | in Proposition |2] 

P[max$fc > (,{n;k)] 

< P[fcT([/, k){T'{n) + 1) > ^{n; k)] 

< P[fc(T'(n) + 1) > fe^] + P[T([/, k) > a„] 



kaAE[r{n)] + l) 

- — ^) — +nm,k)>ar^ 



(36) 



using Markov inequality. By choosing a„ = the second 
term in ( [36] l, viz., P[T([/, k) > a„] — > as ti — > 00, for 
k>U. For the first term, from ^ in Lemma ID E[r'(n)] = 
O(logn). Hence, by choosing a„ = o(^* (n; A;)/ logTi), the 
first term decays to zero. Since £,*{n; U) = uj{logn), we can 
choose an satisfying both the conditions. By letting k = U in 
( l36b . we have P[C'=(n); [/] as cx), and ^ holds. □ 
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